fix: valgrind on ARM#21
Draft
not-matthias wants to merge 22 commits into
Draft
Conversation
…l JSON
New Rust crate (edition 2024) that reads a Callgrind .out profile and extracts call-graph topology (costs/addresses ignored), serializing to canonical index-ref JSON for stable cross-platform callgraph diffing.
Node identity is the {object,file,function} tuple so same-named statics stay distinct. Edges are emitted only on calls= lines (cl-format.xml CallSpec); name compression across three ID spaces, the cfl/cfi alias, inline fi/fe callee-context inheritance, and multi-part merge are handled. 18 integration tests; clippy and rustfmt clean.
Add testdata/*.c fixtures (recursion, chain, diamond, mutual) profiled by the in-repo Callgrind through an rstest harness that compiles each fixture and runs vg-in-place, then snapshots the canonical JSON. --instr-atstart=no plus the fixtures' client requests keep loader/libc frames out, so the JSON is stable across platforms.
The AArch64 B{L} decoder tagged the whole opcode group as Ijk_Call,
but only BL (bit 31 = 1, writes the link register) is a call; a plain
B (bit 31 = 0) is an ordinary unconditional branch.
Mislabelling B as a call made Callgrind treat every branch to a
function epilogue or tail target as a call. At -O0 a conditional like
`return n < 2 ? n : fib(...)` compiles the base case to `b <epilogue>`,
so each base case was counted as a recursive call -- inflating
recursive/cyclic call graphs and inventing phantom self-edges on arm64
(e.g. fib recursion 64 -> 98; mutual is_even/is_odd gaining self-loops).
Align plain B with B.cond and the register-indirect JMP, which already
use Ijk_Boring. Fixes the callgrind-utils recursion/mutual snapshot
failures.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a fixture_full_trace rstest matrix over the same four fixtures, traced with --instr-atstart=yes so the whole program (loader, libc startup, main's own entry) is captured, not just the client-request scoped region. The startup frames carry non-portable names (__libc_start_main@@GLIBC_2.34, raw loader addresses), so this asserts version-stable invariants rather than a golden snapshot: JSON round-trips, main appears as a callee (full-program capture), the fixture's own functions are present, and the per-fixture call shape matches the scoped snapshots. The recursion count (fib'2->fib'2 == 64) and mutual no-self-edge checks double as regression guards for the arm64 B-vs-BL jump-kind fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Profile a Python workload (recursion.py) live under the in-repo Callgrind, mirroring pytest-codspeed: a ctypes-loaded shim (clgctl.c) fires CALLGRIND_START/STOP and adds libpython + the python executable to the obj-skip list at runtime via CALLGRIND_ADD_OBJ_SKIP. Callgrind never names Python-level frames, so the test asserts structure rather than a golden snapshot: the START shim is captured and the Python runtime is folded out.
Merging this PR will degrade performance by 24.12%
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes
|
Add CallGraph::to_flamegraph / to_flamegraph_file mirroring the existing to_json API, rendering a flamegraph SVG via the inferno crate. To weight frames by cost, the parser now captures per-function self cost and per-edge inclusive cost from the positions:/events: layout (first event column, e.g. Ir); costs live outside Node so identity/dedup is unchanged. redact() re-keys self costs onto redacted identities. Folding walks roots top-down, distributing each function's aggregated self cost across incoming paths in proportion to call inclusive cost; recursion and cycles are terminated via an on-path guard and budget pruning.
…megraph Folding distributed a node's self cost by budget/incl, where the budget came from the incoming call edge's inclusive cost. Under --instr-atstart=no the frame that was already running when instrumentation began (e.g. the CPython eval loop around a CodSpeed measured region) is entered by a call that predates measurement, so its incoming edge carries ~zero inclusive cost. Its huge self cost was then scaled to ~0 and dropped, leaving a flamegraph that summed to a few hundred instructions instead of the billions collected. Treat the inclusive cost a node's recorded callers do not account for (incl - sum of incoming edge inclusive) as root budget, so such frames become de-facto roots and surface at full weight. Conservation-respecting graphs are unaffected (uncovered budget is zero for genuine non-roots).
Bump the fixture workload to compute(30) and gate the libpython obj-skip behind CLG_NO_SKIP_PYTHON so the topology-JSON test keeps its stable obj-skipped snapshot while a new python_flamegraph test renders the fixture with the interpreter frames intact. The latter shows the real fib recursion (_PyEval_EvalFrameDefault and the PyLong/frame helpers) instead of the graph folding entirely into (below main).
Revert the no-skip gate: obj-skipping libpython is the real pytest-codspeed scenario, so render the flamegraph from the obj-skipped run (raw graph, not redacted). With the uncovered-budget root fix the folded output is cost-faithful: (below main) holds the full ~1.5B collected instead of being dropped. Keeps compute(30).
Under --instr-atstart=no the measured region begins inside already-obj-skipped libpython, so the whole call tree folds into (below main) and the flamegraph is one uninformative bar. Profiling the flamegraph run with --instr-atstart=yes captures the stack from process start, so the interpreter's fib recursion (_start -> main -> Py_RunMain -> ... -> _PyEval_EvalFrameDefault and the PyLong/frame helpers) is visible. The topology test keeps --instr-atstart=no for its stable obj-skipped snapshot.
…uginfo-path find_debug_file() only checked the hardcoded /usr/lib/debug/.build-id path for build-id-only debug objects (no .gnu_debuglink), which never exists on NixOS. --extra-debuginfo-path was also never consulted for build-id lookups, only for the debugname/debuglink branch. Add try_buildid_dir() and try each colon-separated NIX_DEBUG_INFO_DIRS entry, then --extra-debuginfo-path, as <dir>/.build-id/xx/yyyy.debug before falling back to the FHS path.
chain.svg et al previously came only from the redacted CallGraph, so libc/ld frames always showed as ??? regardless of whether the debug symbols actually resolved. Render the SVG before redact() so it shows real symbol names for local inspection; the JSON snapshot still uses the redacted graph for cross-machine stability. Also ignore *.svg output, which was never tracked.
Incrementally builds the in-repo Callgrind (VEX -> coregrind -> callgrind) before the tests that exec ../vg-in-place run. Tracks the top-level callgrind/*.c and *.h sources via rerun-if-changed so edits trigger a relink, configures the tree on first build (requiring CAPSTONE_DIR from nix develop), and asserts the launcher, tool, and .in_place symlink exist afterward.
The cost-line parser required exactly `num_positions + num_events` tokens, but real Callgrind output uses `positions: instr line` (two position columns) and omits trailing zero event counts, so cost lines are variable-length. Every real cost line was therefore rejected, leaving all self costs at zero and the flamegraph empty for actual profiles (rust/cpp/node samples all folded to 0). Read the first event (Ir) at token index `num_positions`, accepting 1..=num_events trailing counts, and validate the leading tokens as Callgrind position tokens (`*`, `0x..`, absolute, or `+N`/`-N`) to keep rejecting colon headers.
Folding expanded every root-to-leaf path; on a real graph with heavily-shared subtrees (a Node/V8 profile: ~9.5k nodes, 30k edges) this blew up exponentially and never terminated. Prune any branch whose budget falls below a small fraction of the total. Because budget is conserved and splits across a node's children, a relative floor bounds the surviving paths to ~1/fraction, so the same profile now folds in ~70ms. Small graphs are unaffected (the absolute 1-instruction floor still dominates).
…start Pure-compute port of the CodSpeed fractal benchmark: a rich recursive call graph (build/hash/sum/max-path/count/leaves + memoized fib + multi-pass analysis) that fires the Callgrind client requests several frames deep (main -> run_benchmark -> warmup -> run_measured), exercising the shadow-stack seeder. Integer arithmetic and a static node pool keep the graph free of libc/libm frames so the snapshots are stable across platforms. Wired into both fixture_canonical_json and fixture_full_trace.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.